library(dplyr) # Data manipulation library(magrittr) # Pipes library(ggplot2) # Plotting suite library(boot) # Bootstrapping methods library(mice) # Dataset library(DAAG) # CV methods
ADS Lunch Lecture Series
library(dplyr) # Data manipulation library(magrittr) # Pipes library(ggplot2) # Plotting suite library(boot) # Bootstrapping methods library(mice) # Dataset library(DAAG) # CV methods
If we take a simple model in its general notation
\[y=\alpha+\beta x+\epsilon,\]
also known as \[\mathbb{E}[y] = \alpha + \beta x.\]
with consequence \[y = \mathbb{E}[y] + \epsilon.\]
At the end of the day, we still have a model with estimated parameters.
George Box correctly noted that \[\text{ALL MODELS ARE WRONG, BUT SOME ARE USEFUL}\]
He also stated \[\text{HOW WRONG DO THEY NEED TO BE TO NOT BE USEFUL?}\]
These remarks are important; because the implications of of being wrong are not obvious to everyone.
In general, the only correct model is the true data generating model.
In modeling, the data is used to obtain estimates of the observed data
\[\text{A MODEL THAT FITS ONLY ON A GIVEN SET OF DATA}\\ \text{IS NOT USEFUL TO ANYTHING OUTSIDE THAT SET}\]
In cross-validation, the data are divided into the necessary two distinct sets:
Validation set CV: randomly divide the data into two sets
Leave One Out CV (LOOCV): Fit the model on all cases, except one. Evaluate on that case. Repeat for all other cases. Average.
k-Fold CV: Leave more than one out, such that only \(k<n\) cross validations need to be done.
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89
Anscombe, Francis J. (1973). Graphs in statistical analysis.
American Statistician, 27, 17–21.
colMeans(anscombe)
## x1 x2 x3 x4 y1 y2 y3 y4 ## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
anscombe %>% var %>% diag
## x1 x2 x3 x4 y1 y2 y3 ## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620 ## y4 ## 4.123249
anscombe %$% lm(y1 ~ x1) %>% coef
## (Intercept) x1 ## 3.0000909 0.5000909
anscombe %$% lm(y2 ~ x2) %>% coef
## (Intercept) x2 ## 3.000909 0.500000
anscombe %$% lm(y3 ~ x3) %>% coef
## (Intercept) x3 ## 3.0024545 0.4997273
anscombe %$% lm(y4 ~ x4) %>% coef
## (Intercept) x4 ## 3.0017273 0.4999091
Source: https://www.autodeskresearch.com/publications/samestats
Let’s use a simple anscombe data example
fit <- anscombe %$% lm(y1 ~ x1) summary(fit)
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
y_hat <- fit %>% fitted.values()
The residual is then calculated as
y_hat - anscombe$y1
## 1 2 3 4 5 6 ## -0.03900000 0.05081818 1.92127273 -1.30909091 0.17109091 0.04136364 ## 7 8 9 10 11 ## -1.23936364 0.74045455 -1.83881818 1.68072727 -0.17945455
If we introduce new values for the predictor x1, we can generate predicted values from the model
new.x1 <- data.frame(x1 = 1:20) predict.lm(fit, newdata = new.x1)
## 1 2 3 4 5 6 7 ## 3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727 ## 8 9 10 11 12 13 14 ## 7.000818 7.500909 8.001000 8.501091 9.001182 9.501273 10.001364 ## 15 16 17 18 19 20 ## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909
pred <- predict.lm(fit, newdata = new.x1) lm(pred ~ new.x1$x1)$coefficients
## (Intercept) new.x1$x1 ## 3.0000909 0.5000909
fit$coefficients
## (Intercept) x1 ## 3.0000909 0.5000909
predict(fit, interval = "prediction")
## fit lwr upr ## 1 8.001000 5.067072 10.934928 ## 2 7.000818 4.066890 9.934747 ## 3 9.501273 6.390801 12.611745 ## 4 7.500909 4.579129 10.422689 ## 5 8.501091 5.531014 11.471168 ## 6 10.001364 6.789620 13.213107 ## 7 6.000636 2.971271 9.030002 ## 8 5.000455 1.788711 8.212198 ## 9 9.001182 5.971816 12.030547 ## 10 6.500727 3.530650 9.470804 ## 11 5.500545 2.390073 8.611017
Do for all \(j\in\{1,\dots,k\}\) training sets
Small difference suggests good predictive accuracy
DAAG::CVlm(anscombe, fit, printit=F)
anscombe dataDAAG::CVlm(anscombe, fit, plotit=T, printit=F)
boys.fit <- na.omit(boys) %$% lm(age ~ reg + hgt * wgt)
is equivalent to
boys.fit <- na.omit(boys) %$% lm(age ~ reg + hgt + wgt + hgt:wgt)
boys dataDAAG::CVlm(na.omit(boys), boys.fit, plotit=T, printit=F)
set.seed(54321)
height <- rnorm(3e6, mean = 180, sd = 7.5)
shoesize <- 35 + 0.02*height + runif(3e6, -1, 1)
city <- sample(c("Rnhem", "AlkmR", "LeeuwRden"), size = 3e6,
prob = c(0.6, 0.3, 0.1), replace = TRUE)
sumR <- data.frame(height, shoesize, city)
par(mfrow = c(1,3)) hist(height, breaks = 80, col = "light green", border = "transparent") hist(shoesize, breaks = 80, col = "light blue", border = "transparent") table(city) %>% barplot(col = "pink")
(In a real country we will not know these things)
We randomly sample 100 persons and “measure” their height, shoesize and city:
samp100 <- sample(3e6, size = 100) sdat100 <- sumR[samp100, ] head(sdat100)
## height shoesize city ## 2908895 177.8716 39.28683 Rnhem ## 236907 170.2528 38.97513 Rnhem ## 1230916 195.9043 38.27168 LeeuwRden ## 1937234 177.4607 38.39331 Rnhem ## 2782743 172.3269 38.89885 AlkmR ## 2551759 169.6376 39.02313 Rnhem
Let’s use this sample to estimate quantities and infer things about our population!
We want to know the mean height of our population
The sample mean \(\bar{x}\) is an estimator for the population mean \(\mu\)
m <- mean(sdat100$height) m
## [1] 179.7258
The realised design has mean:
sumR$height %>% mean
## [1] 179.9988
This statistic has a sampling distribution: taking a different sample generates a different mean value
Thus we have uncertainty about our estimated mean
se <- sd(sdat100$height)/sqrt(100) se
## [1] 0.6673471
Using the standard error and the assumption of normality, we can create a confidence interval (CI):
ci <- c(m - 1.96*se, m + 1.96*se)
means <- numeric(10000)
for (i in 1:10000) {
means[i] <- mean(sumR$height[sample(3e6, size = 100)])
}
In the 50s and 60s, computers enabled another measure of uncertainty
jknf <- numeric(100) for (i in 1:100) jknf[i] <- mean(sdat100$height[-i]) mean(jknf)
## [1] 179.7258
Similar to LOOCV!
The Jackknife can be used to estimate the standard error:
\[ \begin{align} \hat{se}_{j} &= \sqrt{\frac{n-1}{n}\sum_{i=1}^n\left(\hat{\theta}_{(i)}-\overline{\hat{\theta}_{(\cdot)}}\right)^2} \end{align}\]
jkse <- sqrt(99/100*sum((jknf - mean(jknf))^2))
## Jackknife estimate: 0.667 ## Sample estimate: 0.6673471 ## 'Truth': .750
With this, we can once again create a sampling distribution and a CI based on normal theory.
Introduced by Efron in 1979 as computers became more powerful.
btsp <- numeric(1000) for (i in 1:1000) btsp[i] <- mean(sdat100$height[sample(100, replace = TRUE)])
We sample from our sample, pulling ourselves “up” into the space of population by our bootstraps
The standard error calculation does not need to be known
For some quantities, unreasonable sample sizes are needed before normality (e.g. product of two regression coefficients in mediation analysis)
log(mean(sdat100$shoesize))
## [1] 3.653499
boot packagemystat <- function(dat, index) {
median(dat$height[index])/mean(dat$shoesize[index])
}
bootstat <- sdat100 %>% boot(mystat, R = 1000)
# SE
sd(bootstat$t)
## [1] 0.02899687
# 95% CI bootstat$t %>% quantile(c(0.025, 0.975))
## 2.5% 97.5% ## 4.596886 4.698511
boot package# Distribution data.frame(mystat = bootstat$t) %>% ggplot(aes(x = mystat)) + geom_density(fill = "light seagreen") + xlim(4.5, 4.8) + theme_minimal()
boot can be used for bootstrappingTIL that changing random stuff until your program works is “hacky” and “bad coding practice” but if you do it fast enough it's “#MachineLearning" and pays 4x your current salary
— Steve Maine (@smaine) May 10, 2018